gh-144888: Replace bloom filter linked lists with continuous arrays to optimize executor invalidating performance#145873
gh-144888: Replace bloom filter linked lists with continuous arrays to optimize executor invalidating performance#145873cocolato wants to merge 2 commits intopython:mainfrom
Conversation
markshannon
left a comment
There was a problem hiding this comment.
Thanks for doing this.
I've only had time to do a quick scan, but this looks like it should speed up the scan considerably.
| _PyBloomFilter bloom; | ||
| _PyExecutorLinkListNode links; | ||
| int32_t bloom_array_idx; // Index in interp->executor_blooms/executor_ptrs. | ||
| _PyExecutorLinkListNode links; // Used by deletion list. |
There was a problem hiding this comment.
Is this necessary now? We can traverse all executors using the executor_ptrs array.
There was a problem hiding this comment.
I think we need it to save deletion list:
Lines 332 to 338 in 08a018e
|
@Fidget-Spinner gentle ping, if you have time ,please take a look at this, thanks! |
|
Do you have benchmarks for this? A microbenchmark is fine. |
This comment was marked as outdated.
This comment was marked as outdated.
|
I wrote a microbench: bench.py: import time
N = 1000
ROUNDS = 10
for r in range(ROUNDS):
classes = []
for i in range(N):
cls = type(f"C{i}", (), {"val": i})
ns = {"cls": cls}
exec("def f(n):\n o=cls()\n s=0\n for j in range(n): s+=o.val\n return s", ns)
classes.append((cls, ns["f"]))
for _, f in classes:
for _ in range(200):
f(10)
t0 = time.perf_counter_ns()
for cls, _ in classes:
cls.val = -1
elapsed = time.perf_counter_ns() - t0
print(f"round {r}: {elapsed / 1e3:.1f} us ({elapsed // N} ns/scan)")test.sh: result: |
|
However, since the time spent on sacn is too small compared to warmup, I did not observe any noticeable performance improvement in fastmark. |
|
That's an excellent result! |
During JIT compilation, when function objects are destroyed or code objects are modified, all executors must be traversed to inspect their dependencies, followed by invalidating the relevant executors. The original implementation stored executors using singly linked lists, resulting in numerous pointer jumps during traversal and consequently poor CPU cache efficiency.
This PR changes the executor storage structure from a linked list to a contiguous array, reducing pointer jumps during traversal to improve CPU cache efficiency. It also implements O(1) deletion using swap-remove, thereby accelerating dependency invalidation operations.